{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": [
    "# Notebooks\n",
    "\n",
    "0) **README.ipynb** this notebook  \n",
    "1) **FunctionsAsVectors.ipynb** : Demonstration of the fourier orthonormal basis for functions.  \n",
    "2) **PCA_computation per state.ipynb** : Computing the PCA and other statistics for data from a single state.  \n",
    "3) **Weather Analysis - Initial Visualisation.ipynb** : Visualizing the statistics.  \n",
    "3a) **Weather Analysis - Visualisation after smoothing.ipynb** : Same as (3) but using smoothed signals.  \n",
    "4) **Weather Analysis - reconstruction SNWD.ipynb** : Reconstruction of snow depth using top eigenvector: generates reconstruction file that is used by (5)  \n",
    "4a) **Weather Analysis - reconstruction PRCP-Smoothed.ipynb** : Same as (4) but for smoothed percipitation (*PRCP_s20*)  \n",
    "5) **Maps using iPyLeaflet.ipynb** : Plotting information about stations on an interactive map.  \n",
    "6) **Is SNWD variation spatial or temporal?.ipynb** : Using the \"variance explained\" criteria to decide whether space or time has a bigger effect on a coefficient.  \n",
    "7) **Smoothing.ipynb** : Code for smoothing the measurements across days.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": [
    "# Files / Tables\n",
    "\n",
    "## Readme file describing the original data\n",
    "[download readme file from s3](https://mas-dse-open.s3.amazonaws.com/Weather/Info/ghcnd-readme.txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": [
    "## Source Data\n",
    "\n",
    "The source data is stored in parquet files. `ALL.parquet` contains all of the data, `NY.parquet` includes the data for all stations in New-York State. In the following sections we use NY state, but instead you can use any of:\n",
    "```\n",
    "AB,AL,AR,AZ,BC,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MB,MD,ME,MI,MN,MO,MS,MT,NaN,NB,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,ON,OR,PA,QC,RI,SC,SD,SK,TN,TX,UT,VA,VT,WA,WI,WV,WY\n",
    "```\n",
    "Most of these states are in the continential USA - those contain the complete data. Others are states in Canada and Mexico and have only partial data, Finally, `NaN` contains measurements from stations that are not in any state (and therefor also not in the US).\n",
    "\n",
    "\n",
    "### Schema\n",
    "\n",
    "* **Station:** Station ID\n",
    "* **Measurement:** Type of measurement. Is one of:\n",
    "  * *TMIN* Minimal Temperature during the day.\n",
    "  * *TMAX* Maximal Temperature During the day.\n",
    "  * *TOBS* \"Observed\" temperature. Exact meaning not clear, but is less noisy than *TMIN* and *TMAX*\n",
    "  * *PRCP* Total percipitation during the day.\n",
    "  * *SNOW* Amount of snow that fell during the day.\n",
    "  * *SNWD* Snow Depth\n",
    "  * *TMIN_s20*, *TMAX_s20*,... A smoothed version of the six raw measurements.\n",
    "* **Year**\n",
    "* **Values:** a byte array of length 2*365 representing 365 floats (np.float16)\n",
    "* **State**\n",
    "\n",
    "|    Station|Measurement|Year|              Values|State|\n",
    "|-----------|-----------|----|--------------------|-----|\n",
    "|USC00303452|       PRCP|1903|[00 7E 00 7E 00 7...|   NY|\n",
    "|USC00303452|       PRCP|1904|[00 00 28 5B 00 0...|   NY|\n",
    "|USC00303452|       PRCP|1905|[00 00 60 56 60 5...|   NY|\n",
    "|USC00303452|       PRCP|1906|[00 00 00 00 00 0...|   NY|\n",
    "|USC00303452|       PRCP|1907|[00 00 00 00 60 5...|   NY|\n",
    "\n",
    "### Reading measurement data into a dataframe\n",
    "\n",
    "#### When using your own computer\n",
    "The files are stored on AWS as compressed tar files. The bucket `mas-dse-open` can be accessed through an HTTP connection. \n",
    "\n",
    "1) Download all of the measurements for a particular state (here NY) or ALL for the file that contains all of the states\n",
    "\n",
    "\n",
    "> curl https://mas-dse-open.s3.amazonaws.com/Weather/by_state_2/NY.tgz \n",
    "  \\> `data_dir`/NY.tgz\n",
    "\n",
    "> Where `data_dir` is the local data directory, here `big-data-analytics-using-spark/notebooks/Data/Weather`\n",
    "\n",
    "2) Untar the tar file \n",
    "\n",
    "> tar -xzf `data_dir`/NY.tgz\n",
    "\n",
    "> Creates the parquet directory `data_dir`/NY.parquet\n",
    "\n",
    "3) Read the parquet file into a dataframe:\n",
    "\n",
    "> df=sqlContext.read.parquet(`data_dir`/NY.parquet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": [
    "## Station information\n",
    "\n",
    "Information about each station in the continental united states:\n",
    "\n",
    "### Schema\n",
    "* **Station:** Station ID.\n",
    "* **Dist\\_coast:** Distance from the coast (shoreline) (units? 1.4 of this  per mile?)\n",
    "* **Latitude**\n",
    "* **Longitude**\n",
    "* **Elevation** in meters, missing = -999.9\n",
    "* **Name:** the name of the station.\n",
    "\n",
    "|    Station|Dist_coast|Latitude|Longitude|Elevation|State|            Name|\n",
    "|-----------|----------|--------|---------|---------|-----|----------------|\n",
    "|USC00044534|   107.655| 36.0042|  -119.96|     73.2|   CA|  KETTLEMAN CITY|\n",
    "|USC00356784|   0.61097| 42.7519|-124.5011|     12.8|   OR|PORT ORFORD NO 2|\n",
    "|USC00243581|   1316.54| 47.1064|-104.7183|    632.8|   MT|        GLENDIVE|\n",
    "|USC00205601|   685.501|   41.75| -84.2167|    247.2|   MI|         MORENCI|\n",
    "|USC00045853|   34.2294| 37.1364|-121.6025|    114.3|   CA|         MORGAN HILL|\n",
    "\n",
    "### Downloading stations dataframe from S3:\n",
    "Download from S3:\n",
    "> curl https://mas-dse-open.s3.amazonaws.com/Weather/Weather_Stations.tgz > `data_dir`/Weather_Stations.tgz  \n",
    "\n",
    "Untar. Creates a parquet directory:  \n",
    "> tar -xzf `data_dir`/Weather_Stations.tgz  \n",
    "\n",
    "Read parquest file into dataframe:\n",
    "> stations_df=sqlContext.read.parquet(`data_dir`/Weather_Stations.parquet)\n",
    "\n",
    "Print first 4 rows in stations_df dataframe\n",
    "> stations_df.show(4)                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": [
    "## Statistics file\n",
    "\n",
    "A file with the name `data_dir/STAT_NY.pickle` is pickle file containing the statistics computed for the state of NY. \n",
    "\n",
    "To download file from S3, use\n",
    "> curl https://mas-dse-open.s3.amazonaws.com/Weather/by_state_2/STAT_NY.pickle.gz \\> data_dir/STAT_NY.pickle.gz  \n",
    "\n",
    "To unzip the file use  \n",
    "> gunzip data_dir/STAT_NY.pickle.gz  \n",
    "\n",
    "The pickle file contains a pair: `(STAT,STAT_Descriptions)`. \n",
    "* `STAT` contains the calculated statistics as a dictionary. \n",
    "* `STAT_Descriptions` contains a human-readable description of each element of the dictionary `STAT`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The content of `STAT_DESCRIPTION\n",
    "```\n",
    "   Name  \t                 Description             \t  Size\n",
    "--------------------------------------------------------------------------------\n",
    "SortedVals\t                        Sample of values\tvector whose length varies between measurements\n",
    "     UnDef\t      sample of number of undefs per row\tvector whose length varies between measurements\n",
    "      mean\t                              mean value\t()\n",
    "       std\t                                     std\t()\n",
    "    low100\t                               bottom 1%\t()\n",
    "   high100\t                                  top 1%\t()\n",
    "   low1000\t                             bottom 0.1%\t()\n",
    "  high1000\t                                top 0.1%\t()\n",
    "         E\t                   Sum of values per day\t(365,)\n",
    "        NE\t                 count of values per day\t(365,)\n",
    "      Mean\t                                    E/NE\t(365,)\n",
    "         O\t                   Sum of outer products\t(365, 365)\n",
    "        NO\t               counts for outer products\t(365, 365)\n",
    "       Cov\t                                    O/NO\t(365, 365)\n",
    "       Var\t  The variance per day = diagonal of Cov\t(365,)\n",
    "    eigval\t                        PCA eigen-values\t(365,)\n",
    "    eigvec\t                       PCA eigen-vectors\t(365, 365)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-04-11T03:26:17.001160Z",
     "start_time": "2018-04-11T03:26:16.996236Z"
    }
   },
   "source": [
    "## Reconstruction file\n",
    "\n",
    "Stored in files named `recon_<state>_<measurement>.parquet`\n",
    "#### Fields:\n",
    "1. **Station** :  Station ID\n",
    "21. **State** :  The state in which the station resides\n",
    "22. **Name** :  The name of the station\n",
    "17. **Dist_coast** :  Distance from Coast (units unclear)\n",
    "18. **Latitude** :  of station\n",
    "19. **Longitude** :  of station\n",
    "20. **Elevation** :  Elevation of station in Meters\n",
    "2. **Measurement** :  Type of measurement (TMAX, PRCP,...)\n",
    "3. **Values** :  A byte array with all of the value (365X2 bytes)\n",
    "4. **Year** :  The Year\n",
    "5. **coeff_1** :  The coefficient of the 1st eigenvector\n",
    "6. **coeff_2** :  The coefficient of the 2nd eigenvector\n",
    "7. **coeff_3** :  The coefficient of the 3rd eigenvector\n",
    "8. **coeff_4** :  The coefficient of the 4th eigenvector\n",
    "9. **coeff_5** :  The coefficient of the 5th eigenvector\n",
    "16. **total_var** : The total variance (square distance from the mean. \n",
    "15. **res_mean** :  The residual variance after subtracting the mean.\n",
    "10. **res_1** :  The residual variance after subtracting the mean and eig1 \n",
    "11. **res_2** :  The residual variance after subtracting the mean and eig1-2\n",
    "12. **res_3** :  The residual variance after subtracting the mean and eig1-3 \n",
    "13. **res_4** :  The residual variance after subtracting the mean and eig1-4 \n",
    "14. **res_5** :  The residual variance after subtracting the mean and eig1-5 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
